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ABSTRACT 



A web-based voice dialog interface for use in communicat- 
ing dialog information between a user at a client machine 
and one or more servers coupled to the client machine via 
the Internet or other computer network. The interface in an 
illustrative embodiment includes a web page interpreter for 
receiving information relating to one or more web pages. 
The web page interpreter generates a rendering of at least a 
portion of the information for presentation to a user in an 
audibly-perceptible format. A grammar processing device 
utilizes interpreted web page information received from the 
web page interpreter to generate syntax information and 
semantic information. A speech recognizer processes 
received user speech in accordance with the syntax 
information, and a natural language interpreter processes the 
resulting recognized speech in accordance with the seman- 
tics information to generate output for delivery to a web 
server in conjunction with a voice dialog which includes the 
user speech and the rendering of the web page(s). The output 
may be processed by a common gateway interface (CGI) 
formatter prior to delivery to a CGI associated with the web 
server. 

20 Claims, 3 Drawing Sheets 
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WEB-BASED VOICE DIALOG INTERFACE 

PRIORITY CLAIM 

The present application claims the priority of U.S. Pro- $ 
visional Application No. 60A35,130 filed May 20, 1999 and 
entitled "Web-Based Voice Dialog Interface." 

FIELD OF THE INVENTION 

The present invention relates generally to the Internet and io 
other computer networks, and more particularly to tech- 
niques for communicating information over such networks 
via an audio interface. 

BACKGROUND OF THE INVENTION 15 

The continued growth of the Internet has made it a 
primary source of information on a wide variety of topics. 
Access to the Internet and other types of computer networks 
is typically accomplished via a computer equipped with a 
browser program. The browser program provides a graphi- 
cal user interface which allows a user to request information 
from servers accessible over the network, and to view and 
otherwise process the information so obtained. Techniques 
for extending Internet access to users equipped with a 
telephone or other type of audio interface device have been 
developed, and are described in, for example, D. L. Atkins 
et al., "Integrated Web and Telephone A Language Interface 
to Networked Voice Response Units," Workshop on Internet 
Programming Languages, ICCL '98, Loyola University, ^ 
Chicago, 111., May 1998, both of which are incorporated by 
reference herein. 

Current approaches to web -based voice dialog generally 
fall into two categories. The first category includes those 
approaches that use HyperText Markup Language (HTML) 35 
and extensions such as Cascading Style Sheets (CSS) to 
redefine the meaning of HTML tags. 

The second of the two categories noted above includes 
those approaches that utilize a new language specialized for 
voice interfaces, such as Voice extensible Markup Language 40 
(VoiceXML) from the VoiceXML Forum (which includes 
Lucent, AT&T and Motorola), Speech Markup Language 
(SpeechML) from IBM, or Talk Markup Language 
(TalkML) from Hewlett-Packard. These languages may be 
viewed as presentation mechanisms that address primarily 45 
the syntactic issues of the voice interface. The semantics of 
voice applications on the web are generally handled using 
custom solutions involving either client-side programming 
such as Java and Javascript or server-side methods such as 
Server-Side Include (SSI) and Common Gateway Interface 50 
(CGI) programming. In order to create a rich dialog interface 
to a computer application using these language-based 
approaches, an application developer generally must write 
explicit specifications of the sentences to be understood by 
the system, such that the actual spoken input can be trans- 55 
formed into the equivalent of a mouse-click or keyboard 
entry to a web form. 

Examples of web-based voice dialog systems are 
described in U.S. patent application Ser. No. 09/168,405, 
filed Oct. 6, 1998 in the name of inventors M. K. Brown ct 60 
al. and entided "Web-Based Platform for Interactive Voice 
Response," which is incorporated by reference herein. More 
specifically, this application discloses an Interactive Voice 
Response (IVR) platform which includes a speech 
synthesizer, a grammar generator and a speech recognizer. 65 
The speech synthesizer generates speech which character- 
izes the structure and content of a web page retrieved over 
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the network. The speech is delivered to a user via a tele- 
phone or other type of audio interface device. The grammar 
generator utilizes textual information parsed from the 
retrieved web page to produce a grammar. The grammar is 
then supplied to the speech recognizer and used to interpret 
voice commands generated by the user. The grammar may 
also be utilized by the speech synthesizer to create phonetic 
information, such that similar phonemes are used in both the 
speech recognizer and the speech synthesizer. 

The speech synthesizer, grammar generator and speech 
recognizer, as well as other elements of the IVR platform, 
may be used to implement a dialog system in which a dialog 
is conducted with the user in order to control the output of 
the web page information to the user. A given retrieved web 
page may include, for example, text to be read to the user by 
the speech synthesizer, a program script for executing opera- 
tions on a host processor, and a hyperlink for each of a set 
of designated spoken responses which may be received from 
the user. The web page may also include one or more 
hyperlinks that are to be utilized when the speech recognizer 
rejects a given spoken user input as unrecognizable. 

Despite the advantages provided by the existing 
approaches described above, a need remains for further 
improvements in web-based voice dialog interfaces. More 
specifically, a need exists for a technique which can provide 
many of the advantages of both categories of approaches, 
while avoiding the application development difficulties 
often associated with the specialized language based 
approaches. 

SUMMARY OF THE INVENTION 

The present invention provides an improved voice dialog 
interface for use in web -based applications implemented 
over the Internet or other computer network. 

In accordance with the invention, a web-based voice 
dialog interface is configured to communicate information 
between a user at a client machine and one or more servers 
coupled to the client machine via the Internet or other 
computer network. The interface in an illustrative embodi- 
ment includes a web page interpreter for receiving informa- 
tion relating to one or more web pages. The web page 
interpreter generates a rendering of at least a portion of the 
information for presentation to a user in an audibly- 
perceptible format. The web page interpreter may make use 
of certain pre-specified voice-related tags, e.g., HTML 
extensions. A grammar processing device utilizes interpreted 
web page information received from the web page inter- 
preter to generate syntax information and semantic infor- 
mation. A speech recognizer processes received user speech 
in accordance with the syntax information, and a natural 
language interpreter processes the resulting recognized 
speech in accordance with the semantics information to 
generate output for delivery to a web server in conjunction 
with a voice dialog which includes the user speech and the 
rendering of the web page(s). The output may be processed 
by a common gateway interface (CGI) formatter prior to 
delivery to a CGI associated with the web server. 

The grammar processing device may include a grammar 
compiler, and may implement a grammar generation process 
to generate a grammar specification language which is 
supplied as input to a grammar compiler. The grammar 
generation process may utilize a thesaurus to expand the 
grammar specification language. 

In accordance with another aspect of the invention, the 
web page interpreter may further generate a client library 
associated with interpretations of web pages previously 
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performed on a common client machine. The client library 
will generally include a script language definition of seman- 
tic actions, and may be utilized by a web server in generating 
an appropriate response to a user speech portion of a dialog. 

In accordance with a further aspect of the invention, 
dialog control may be handled by representing a given 
dialog turn in a single web page. In this case, a finite-state 
dialog controller may be implemented as a sequence of web 
pages each representing a dialog turn. 

In accordance with yet another aspect of the invention, the 
processing operations of the web-based voice dialog inter- 
face are associated with an application developed using a 
dialog application development tool. The dialog application 
development tool may include an authoring tool which (i) 
utilizes a grammar specification language to generate output 
in a web page format for delivery to one or more clients, and 
(ii) parses code to generate a CGI output for delivery to the 
web server. 

Advantageously, the techniques of the invention allow a 
voice dialog processing system to reduce client-server traffic 
and perform immediate execution of client-side operations. 
Other advantages include less computational burden on the 
web server, the elimination of any need for specialized 
natural language knowledge at the web server, a simplified 
interface, and unified control at both the client and the 
server. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of an illustrative web-based 
processing system which includes a voice dialog interface in 
accordance with the invention. 

FIG. 2 illustrates a finite-state dialog process involving a 
set of web pages and implemented using the web-based 
processing system of FIG. 1. 

FIG. 3 illustrates the operation of a web-based dialog 
application development tool in accordance with the inven- 
tion. 

DETAILED DESCRIPTION OF THE 
INVENTION 

The present invention will be illustrated below in con- 
junction with an exemplary web -based processing system. It 
should be understood, however, that the invention is not 
limited to use with any particular type of system, network, 
network communication protocol or configuration. The term 
"web page" as used herein is intended to include a single 
web page, a set of web pages, a web site, and any other type 
or arrangement of information accessible over the World 
Wide Web, over other portions of the Internet, or over other 
types of communication networks. The term "processing 
system" as used herein is intended to include any type of 
computer-based system or other type of system which 
includes hardware and/or software elements configured to 
provide one or more of the voice dialog functions described 
herein. 

The present invention in an illustrative embodiment auto- 
mates the application development process in a web-based 
voice dialog interface. The interface in the context of the 
illustrative embodiment will be described herein using a 
number of extensions to conventional HyperText Markup 
Language (HTML). It should be noted that, although the 
illustrative embodiment utilizes HTML, the invention can be 
implemented in conjunction with other languages, e.g., 
Phone Markup Language (PML), Voice extensible Markup 
Language (VoiceXML), Speech Markup Language 
(SpeechML), Talk Markup Language (TalkML), etc. 
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HTML Extensions 

The above-noted HTML extensions may be embedded in 
the scope of an HTML anchor as follows: 

<A>HREF="URL" special_tags>title</A> 
where URL represents the Uniform Resource Locator and 
title is the string of mouse-sensitive words of the hyperlink. 
The spccial_tags are generally ignored by conventional 
visual web browsers that are not designed to recognize them, 
but have special meaning to voice browsers, such as the 
PhoneBrowser built on the Lucent Speech Processing Sys- 
tem (LSPS) platform developed by Lucent Technologies Inc. 
of Murray Hill, NJ. Examples of the special tags include the 
following: 



20 



VOICE="parameters" 
IGNORETTTLE 



NOPERMUTE 

25 LSPSGSL-"string" 

LSPSGSLHREF-"URL" 
DtSOVERRIDE 

30 



PRIORITY** 



35 



URLINSERT 



BARGEIN-{ "ON" | 
"OFF' } 

INITIALS HO UT= 

seconds 



45 



GAPTIMEOUT-seconds 



50 



MAXnMEOUl -seconds 



55 



Inhibits Text-to-Speech (TTS) processing 
of the title of this link, making it silent 
Set parameters for voice synthesis. 
Inhibits Automatic Speech Recognition 
(ASR) processing of the title of this link; 
usually used with Grammar Specification 
Language (GSL). 

Inhibits combinatoric processing of the 
title of this link for ASR; forces the user 
to speak the entire title. 
Defines a GSL grammar to be used by 
ASR for this link. This must use the 
LSPS syntax, and is platform-dependent. 
Defines a GSL grammar, as above, ob- 
tained from a URL, 

Causes the link title to take precedence 
over normal anchor titles during disam- 
biguation, including built-in 
PhoneBrowser commands. If several items 
specify DISOVERRIDE then disambigua- 
tion will take place among them. 
Set the command priority level, higher #'s 
take precedence. 

Causes the ASR or DTMF response string 
triggering this anchor to be inserted in the 
URL in place of a "%s". Typically used in 
a OUERY_JNFO string. 
Turn barge-in on or off (default is on). 

Specify how many seconds can elapse 
from the time the recognizer is started to 
the time the user starts speaking. If no 
speech starts by this time, the URL 
(required) is taken. 

Specify how many seconds can elapse 
from the time the user stops speaking to 
the time that recognition takes place. If 
nothing is recognized during this time, 
it is presumed that the utterance was not 
recognized, and the URL (required) is 
taken. A default value of two seconds is 
normally supplied, and this should be 
specified only in special circumstances. 
Specify how many seconds can elapse 
from the time the recognizer is started 
to the time that recognition takes place. If 
no speech starts by this time, or nothing 
has been recognized, the URL (required) 
is taken. 



60 



65 



Three of the above-listed tags form the basis for defining 
a language interface that is richer than simple hyperlink 
titles. For the LSPS platform, which will be used in the 
illustrative embodiment, these are LSPSGSL, 
LSPSGSLHREF, and URLINSERT. The first two allow the 
specification of a rich speech recognition (SR) grammar and 
vocabulary. In a more general purpose implementation, 
these might be replaced with other tags, such as GRAM- 
MAR and GRAM HREF, respectively, as described in the 
above-cited U.S. patent application Ser. No. 09/168,405. 
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The third tag, URUNSERT, allows arbitrary SR output to be can be nested. Square brackets contain the name of a C 

communicated to a web server through a Common Gateway function that will be called when the adjoining word (or 

Interface (CGI) program. As will be described in greater phrase) is spoken in this sentence. Curly brackets enclose 

detail below, these extensions provide the basis for a more argument strings that will be sent to the C function. When 

powerful set of web-based speech application tools. S the user says "rotate the green cup" the outcome is the C 

The above-listed IGNORETTTLE and NOPERMUTE function call: 

tags will now be described in greater detail. The current Rotate("green cup"); 

implementation of PhoneBrowser normally processes Another way to implement semantic actions is to use a 

hyperlink titles to automatically generate navigation com- dispatch function as follows: 

mand grammars. The processing involves computing all 10 {[Exec]{0 (move|rotate)} the {1 (red|green)(cup|block) 

possible combinations of meaningful words of a title (i.e., j j 

simple function words like "the/' "and," etc. are not used in In thjs c ^ ^ dispalco mnc tion Exec is called with 

isolation), thereby allowing word deletions so that the user argument 0 set to "rotate," thereby signaling Exec to call the 

may speak some, and not aU, of the words in a title phrase. Rotate function. 

This simple language model expansion mechanism gives the 15 spec jfi cat i on f orm ^ very general. C functions can be 

user some flexibility to speak a variety of commands to defirjed arjywhere a sentence statement and argu- 

obtain the same results. The IGNORETTTLE tag causes the mcD(s can bc arbitrarily nested (even reusing the 

system to inhibit all processing of the hyperlink title. This is same text repeate dl y ). Functions defined within the scope of 

usually only useful when combined with one of the grammar ^ argumcnt m tnc scope of another function will return a 

definition tags, but may also be used for certain timout 20 computed argument value to the enclosing function at 

effects. The NOPERMUTE tag inhibits processing of the cxccut i 0 n time. Hence, a complete function call tree is 

title word combinatorics, making only the full explicit title crea ted. 

phrase available in the speech grammar. Thc simplc cxamp i c g i ven aDO ve only specifies six sen- 
It should be understood that the above-desenbed tags are {Qnce possibiuties> More typical definitions would specify 
shown by way of illustrative example only, and should not 25 complex syntax and semantics having many thousands of 
be construed as limiting the invention in any way. Other sen tence possibilities (the full robot grammar for this 
embodiments of the invention may utilize other types of example specified 6X10 20 sentences in about 1.5 pages of 
tags. GSL code). 

TT ^ j o . /c *• c -««..-™ The actual GSL implementation is also more complicated 

Unified Syntactic/Semantic Specifications 30 ^ iUustrated hc £ The compiler performs macro 

Conventional methods for creating web -based speech expansion, takes cyclic and recursive expressions, performs 

applications generally involve design of speech grammars recursion transformations, performs four stages of 

for SR and the design of a natural language command optimization, and generates syntactic and semantic parsers, 

interpreter to process the SR output. Grammars are usually The semantic function interface follows the Unix protocol 

defined in finite-state form but are sometimes expressed as 35 using the well-known Unix func (argc, argv) format. The 

context-free gram mars(CFGs). Natural language interpret- semantic parser can be separated from the syntactic parser 

ers generally include a natural language parser and an and used as a natural language keyboard interface, 
execution module to perform the actions specified in the 

natural language input. This combination provides the basic Lexicon Driven Semantics 

mechanism for processing a discourse of spoken utterances. j t ^ ^own that semantic specification expressions can be 

Discourse, in this case, is defined as a one-sided sequence of written by attaching C functions to verbs while collecting 

expressions e.g., one agent speaking one or more sentences. adjectives and nouns into arguments. In accordance with the 

Many existing SR products use a grammar definition invention, this process can be simplified further for the 

language called Grammar Specification Language (GSL). ^ 5 application developer by providing a natural language lexi- 

GSL in its original versions was generally limited to syn- con containing word classifications. This lexicon can either 

tactic definition. Later versions of GSL incorporate semantic reside in the client (e.g., in a browser) or in a web server, 

definitions into the syntactic specification. The resulting u sing tne aD ove-noted URUNSERT mechanism that 

grammar compiler automatically creates the command inter- inserts an SR output string directly into a URL, a server-side 

preter as well as the finite-state or CFG representation of the $Q \ cx [ con would generally be needed. Each HTML page may 

language syntax. use a different lexicon and it is desirable to share lexicons 

In accordance with the present invention, the process of across many servers, so a lexicon may reside on a server 

developing web-based speech applications can be automated different from the semantics-processing server. With a minor 

by using an extension of these principles for HTML-based extension of the URLINSERT mechanism the lexicon infor- 

speech applications. 55 mation could bc sent to the server using the POST mecha- 

Original semantic GSL expressions take the following nism of the HyperText Transfer Protocol (HTTP). However, 

example form, from a robot control grammar described in this approach puts an increased burden on the server. A 

M. K. Brown, B. M. Buntschuh and J. G. Wilpon, "SAM: A server-side solution using a variety of such lexicons is also 

Perceptive Spoken Language Understanding Robot," IEEE inconsistent with the stateless nature of existing web server 

Trans. SMC, Vol. 22, No. 6, pp. 1390-1402, September 60 technology. 

1992, which is incorporated by reference herein: Lexicon driven semantics generally require a higher level 

{(move[Move]|rotate[Rotate])lhe{l(red|green) representation of language structure. Phrase structure gram- 

(cup|block)}}. mar variables are used to define the sentence structure, 

In this example, each statement is a sentence. Each word which can be broken down into more detailed descriptions, 

could become a phrase in a more general example. Paren- 65 eventually leading to word categories. Word categories are 

theses enclose exclusive OR forms, where each word or typically parts of speech such as noun, adjective and verb 

phrase is separated by vertical bars, and these expressions designators. Parsing of a sentence is performed bottom up 
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until a complete phrase structure is recognized- The seman- In the system 100 of FIG. 1, speech received from a user 

tics are then extracted from the resultant parse tree. Verb is processed in an automatic speech recognizer (ASR) 120 

phrases are mapped into semantic actions while noun utilizing,the syntax 110 generated by the grammar compiler 

phrases arc mapped into function arguments. 106. The output of the ASR is applied to a natural language 

s interpreter 122 which utilizes the lexical semantics 112 

Client-Side Semantics generated by the grammar compiler 106. The output of the 

Converting syntax to semantics at the client has a number interpreter 122 is supplied to client exec- 

of advantages, including: less computational burden on the "ve 124 and CGI formatter 126, both of which communicate 

web server; distribution of computation to clients; no need with a web server 128. The client executive 124 processes 

for specialized knowledge of natural language at the server; ™ the interpreted speech from the interpreter 122 in accordance 

a simplified interface; unified control at both the client and with information in the client library 114. The client execu- 

server; and fast response to local commands. tive 124 can h \ ° nc °£ 3 vanet y of interpreters .such as Java, 

FIG. 1 shows a processing system 100 which implements t™*™** or ^ uaLBaSlC ^rpretersllie CGI formatter 

, , j p & j r 126 can also be written in one of these languages and 

a web-based vo.ce cha og ; interface in accordance wrth the be ^ 

UlustnUve embod.ment of the mventioo. Th porUons ^of the talented as part of a client browser. 

system 100 other than web server 128 are assumed for this j r r 

example to be implemented on the client-side, e.g., in a Although shown as separate elements in the system 100, 

browser associated with a client computer or other type of the ASR 120 and natural language interpreter 122 may be 

client processing device. A client in accordance with the ^rent elements of a single speech recognition device 

invention may any type of computer, computer system, 20 Moreover, although illustrating as including a single web 

processing device or other type of device, e.g., a telephone, server, the system 100 can of course be utilized in conjunc- 

a television set-top box, a computer equipped with tele- tion with multiple servers in numerous different arrange- 

phony features, etc., capable of receiving and/or transmitting ments. 

audio information. The incoming HTML information in the system 100 of 

The client-side portions of the system 100 are assumed to 25 FIG. 1 is thus processed for multiple simultaneous purposes, 

be coupled to the web server 128 via a conventional network i.e. f to generate the rendenng 104, to extract a natural 

connection, e.g., a connection established over a network in language model containing both syntactic and semantic 

a conventional manner using the Transmission Control information m the form of respective syntax 110 and lexical 

Protocol/Internet Protocol (TCP/IP) standard or other suit- semantics 112, and to generate a script language definition 

able communication protocol(s). of semantic actions via the client library 114. 

The system 100 receives HTML information from the Advantageously, extracting semantics on the client side in 

Internet or other computer network in an HTML interpreter tbe manner illustrated in FIG. 1 allows the system 100 to 

102 which processes the HTML information to generate a reduce client-server traffic and perform immediate execution 

rendering 104, i.e., an audibly-perceptible output of the 35 of client-side operations. 

corresponding HTML information for delivery to a user. The The CGI format as implemented in the CGI formatter 126 

rendering 104 may include both visual and audio output. The will now be described in greater detail. A general URL 

HTML information is also delivered to a grammar compiler format suitable for use in calling a CGI in the illustrative 

106 which processes the information to generate a syntax embodiment includes five components: protocol, host, path, 

110 and a set of lexical semantics 112. The grammar 40 PATH_INFO, and QUERY_STO1NG, in the following 

compiler 106 may be of the type described in M. K. Brown syntax: 

and J. G. Wilpon, "A Grammar Compiler for Connected {protocol} ://{host}/{path}/{PATH_INFO}?{QUERY_ 

Speech Recognition," IEEE Trans. ASSP, Vol. 39, No. 1, pp. STRING} 

17-28, January 1991, which is incorporated by reference where protocol can generally be one of a number of known 
herein. The HTML interpreter 102 also generates a client 45 protocols, such as, e.g., http, ftp, wais, etc., but for use with 
library 114. a CGI the protocol is generally http; host is usually a fully 
It should be noted that the grammar compiler 106 may qualified domain name but may be relative to the local 
incorporate or otherwise utilize a grammar generation domain; path is a slash-separated list of directories ending 
process, such as that described in greater detail in the with a recognized file; PATH_INFO is additional slash- 
above-cited U.S. patent application Ser. No. 09/168,405, so separated information that may contain a root directory for 
filed Oct. 6, 1998 in the name of inventors M. K. Brown et CGI processing; and QUERY_STR1NG is an ampersand- 
al. and entitled "Web-Based Platform for Interactive Voice separated list of name-value pairs for use by a CGI program. 
Response." For example, such a grammar generation pro- The last two items become available to the CGI program as 
cess can receive as input parsed HTML, and generate GSL environment values in the environment of the CGI at the 
therefrom. The grammar compiler 106 may be configured to 55 W eb server 128. Processing of the URL by the client and web 
take this GSL as input and create an optimized finite-state server is as follows: 

network for a speech recognizer. More particularly, the GSL j c ^ erjt conne cts to host (or sends complete URL to 

may be used, e.g., to program the grammar compiler 106 proxy and proxy connects to host) web server; 

with an expanded set of phrases so as to allow a user to speak 2 dienl QET Qr p0ST gl ^ |he remainder 

partial phrases taken from a hyperlink title. In addition, a 60 Qf thc mL ^ tfac ^ 

stored thesaurus can be used to replace words with syn- 

onyms so as to further expand the allowed language. 3- scfVcr P^ s P ath searching torn the public filesystem 

' ^ A - . 1 r « root until it recognizes a path element; 

The grammar compiler 106 is an example of a grammar & r , r , * 

processing device" suitable for use with the present inven- 4. server continues parsing path until either end of string 

tion. Such a device in other embodiments may incorporate 65 or <?' token is seen, setting PATH_INFO; and 

a grammar generator, or may be configured to receive input 5. server sets QUERY_STR1NG with remaining URL 

from a grammar generator. string. The URL may not contain white-space charac- 
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ters but QUERY_STRING blanks can be represented FIG. 2 illustrates a finite state dialog controller 200 of this 

with characters. typ e * dialog controller 200 uses the HTML extensions 

Continuing with the previous robot grammar example, for described previously. Controlled speech synthesis output of 

server-side execution the speech grammar specification can a gi ven web P a S e is presented to a user, and the current 

be written into a hyperlink: 5 contcx * °f command grammar is defined and utilized, in a 

* Tm I -- * ... i_- c n i mi rvTcrnT manner similar to that previously described in conjunction 

<A HREF="http://hdst/pathinfo?955s" URUNSERT fIG \ 

GSL=«{(move[Move]|rotate[Rotate]) dialog 2 00 of FIG. 2 operates on 

the{l(red|green)(cup|block)}."> a set of web pages which include in this example web pages 

Title</A> io 202, 204, 206 and 208. Web page 202 is an HTML page 

In this example, the underlying platform has been which represents a "Welcome" page, and includes "Start" 

extracted from the grammar specification tag. The presence and "Help" hyperlinks. The "Help" hyperlink leads to web 

of semantics in the GSL string indicates that the QUERY_ page 204, which includes a "How to" section and a "Start" 

INFO string should contain a preprocessed semantic expres- hyperlink. The "Start" hyperlinks on pages 202 and 204 both 

sion rather than the unprocessed SR output string. In this 15 lead to page 206, which includes computed HTML corre- 

case, URUNSERT will result in analysis of the SR output sponding to an output of the form "I want to do {1 . . . } to 

text yielding the URL: {2 . . . }." The web page 208 represents the next dialog turn. 

http://host/pathinfo?EXEC="(Rotate+ U'green+cup*}" In the controller 200, the HTML for a given dialog turn is 

A concise format is used. The curly brackets delimit constructed using a CGI 210 which may be configured to 

scope. Argument numbers indicate argument positions, and 20 include application-specific knowledge. As shown in FIG. 2, 

do not need to be in order or consecutive (i.e., some or all the CGI 210 interacts with a database interface (DBI) 212 

arguments can be undefined). Nested functions can be and a database driver (DBD) 214. The DBI 212 is coupled 

handled by nesting the call format as the following example via the DBD 214 to a commercial database management 

illustrates: system (DBMS) 216. Suitable DBIs and DBDs are freely 

7EXEC ^^ ^funcl+l- , ^func2+l- < a^gl , +2- < a^g2 , }}" 25 available on the Internet for most of the popular commercial 

TTie function name does not need to appear first within the DBMS products. The CGI 210 > further interacts with an 

execution scope, although it may be easier to read this style. application program interface (API) 218 to an underlying set 

Execution on the client side would normally be limited by of one or more apphcaUon(s) 220. 

security measures, since the content from the web server When a user speaks a client-side command, such as 

may originate from an unreliable source. For purposes of 30 "speak faster 1 ' or "speak louder," the command is executed 

simplicity and clarity of illustration, however, such security immediately and the presentation continues. When a navi- 

concerns will not be considered in the present description. gation command associated with a hyperlink is spoken, 

These concerns can be addressed using convention security control is transferred to the corresponding new web page, 

techniques that are well-understood in the art. dialog turn, and presentation and speech grammar context. 

On the client side, the Rotate operation is performed by 35 The process can then continue on to a new dialog state. In 

calling the Rotate function defined in the client library 114 this way, using many relatively small web pages, a complete 

of FIG. 1. The Rotate function can be defined in Java, for client-server dialog system can be created, 

example, and called upon receiving the appropriate speech Condition Handling 

command. ^ Conditions are system states that prompt the interface 

«7 u o a n system or the application to take the initiative. Such a 

Web-Based Dialog J , . ■ o*w . j u j • ,u 

mechanism was used in the SAM system described m the 

The term "dialog" generally refers to a multi-sided above-cited M. K. Brown el al. reference. Additional details 

sequence of expressions. Handling dialog in a voice dialog regarding conditions in the context of dialog can be found in, 

interface generally requires an ability to sequence through 45 e.g., J. Chu -Carroll and M. K. Brown, "An evidential model 

what is commonly called a dialog turn. A dialog turn may be for tracking initiative in collaborative dialogue interactions," 

defined as two or more "plys" in a dialog tree or other type User Modeling and User-Adapted Interaction Journal, Spe- 

of dialog graph necessary to complete an exchange of c ial Issue on Computational Models for Mixed Initiative 

information. A dialog graph refers generally to a finite-state Interaction, 1998; J. Chu-Carroll and M. IC Brown, "Initia- 

representation of a complete set of dialog exchanges 5Q UV e in collaborative interactions — Its cues and effects," In 

between two or more agents, and generally contains states Working Notes of the AAAI-97 Spring Symposium on 

and edges as does any mathematical graph. The dialog graph Computational Models for Mixed Initiative Interaction, 

may be virtual in the sense that the underlying implemen- pages 16-22, 1997; and J. Chu-Carroll and M. K. Brown, 

tation is rule-based, since rule-based systems maintain "Tracking initiative in collaborative dialogue interactions," 

"state" but may not be finite in scope. A "ply" is a discourse 5S ] n proceedings of the 35th Annual Meeting of the Associa- 

by one agent. When discussing dialogs of more than two tion for Computational Linguistics (ACL-97), pages 

agents, the conventional terminology "dialog turn" may be 262-270, 1997, all of which are incorporated by reference 

inadequate, and other definitions may be used. herein. 

It should be noted that web-based dialogs may model a Dialog system conditions may be used to trigger a dialog 

given computer or other processing device as a single agent $n manager to lake charge for a particular period, with the 

that may be multi-faceted, even though the actual system dialog manager subsequently relinquishing control as the 

may, include multiple servers. The primary, multi-faceted system returns to normal operation, 

agent may then serve as a portal to the underlying agents. Examples of condition types include the following: error 

In accordance with the invention, control of dialog for the conditions, task constraints, missing information, new 

single agent can be handled by representing a single two-ply 65 language, ambiguity, user confusion, more assistance 

dialog turn in a single HTML page. A sequence of such available, hazard warning, command confirmation, and hid- 

pages forms a finite-state dialog controller. den event explanation. 
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These conditions can be created by the user, the system or 
both, and are listed above in approximate order of severity. 
The first five conditions are severe enough to prevent 
processing of a command until the condition is addressed. 
User confusion is a more general condition that may prevent 5 
further progress or may simply slow progress. The remain- 
ing conditions will not prevent progress but will prompt the 
system to issue declarative statements to the user. 

Error conditions generally fall into three classes: appli- 
cation errors, interface errors, and user errors. Application 10 
errors occur when the application is given information or 
commands that are invalid in the current application state. 
For example, database information may be inconsistent with 
new data, etc. This kind of error needs to be handled by an 
application having knowledge of the associated processing, 15 
but may also require additional HTML content to provide 
user feedback. For example, the user may be taken to a help 
system. 

Interface errors in this context are speech recognition 
errors that in many cases are easy for the user to correct by 20 
simply issuing a designated command such as a "go back" 
command. In some cases, processing may not easily be 
reversed, so an additional confirmation step is advisable 
when speech recognition errors could be costly. Keeping the 
grammar context limited, whenever possible, decreases the 25 
likelihood of recognition errors but can also create a variety 
of other problems when the user is prone to making a 
mistake about bow the application functions. 

A user command may be syntactically and semantically ^ 
correct but not possible because the application is unable to 
comply. Handling task constraints requires a tighter cou- 
pling between the application and the interface. In most 
cases, the application will need to signal the interface of 
inability to process and command and perhaps suggest ways ^ 
that the desired goal can be achieved. This signal may be at 
a low application level having no knowledge of natural 
language. The interface then must expand this low level 
signal into a complete natural language expression, perhaps 
initiating a side dialog to deal with the problem. ^ 

Often the user will provide only some of the information 
necessary to complete a task. For example, the user might 
tell a travel information agent that they "want to go to 
Boston." While the system might already know that the user 
is in, e.g., New York City, it is still necessary to know the 4S 
travel date(s), time of day, and possible ground transporta- 
tion desired. In this case, offering more assistance may be 
desirable, or simply asking for the needed information may 
suffice. 

Occasionally the user will speak a new word or words that 50 
the system has not heard before. This causes the interface to 
divert to a dialog about the new word(s). The user can be 
asked to tell the system the type of word (adjective, noun, 
verb, etc.) and possibly associate the new word with other 
words the system already knows about. Acquiring the acous- ss 
tic patterns of new words is also possible using phonetic 
transcription grammars, with speech recognition, but is 
technically more difficult. 

U should be noted that commands can be ambiguous. The 
system can handle this by listing a number of possible 60 
explicit interpretations using, e.g., different words to express 
the same meaning or a more elaborate full description of the 
possible interpretations. The user can then choose an inter- 
pretation or rephrase the command and try again. 

User confusion may be detected by measuring user per- 65 
formance parameters such as long response times, frequent 
use of incomplete or ambiguous commands, lack of progress 
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to a goal, etc. As such, user contusion is not detected quickly 
by the system but is a condition that results from an 
averaging of user performance. As such a user confusion 
index slowly increases, the system should offer increasing 
levels of assistance, increasing the verbosity of conversa- 
tion. An expert user will thus be able to quickly achieve 
goals with low confusion scores. 

Hazard warnings and command confirmation work 
together to protect the user and system from performing 
dangerous, possibly irreversible actions. Examples include 
changing database entries that remove previous data, pur- 
chasing no n- re fundable airline tickets, etc. In many cases, 
these actions may not be visible or obvious to the user, or it 
may be desirable to explain to the user not only what the 
system is doing on behalf of the user, but also how the 
system is doing it. 

It is usually important not to prevent the user from making 
mistakes by simply ignoring invalid requests, because the 
user will find it difficult to learn about such mistakes. 
Leaving all invalid commands out of the grammar for a 
given context may therefore result in user confusion. 
Instead, a well designed error handling system will recog- 
nize the erroneous command and send the user to a source 
of context-sensitive help for information on the proper use 
of commands in the current system state. User errors involv- 
ing misunderstanding of the application may require coop- 
eration between an application help system and an interface 
help system, since the user may not only be using the 
application incorrectly at a given point but have thereby 
arrived at an incorrect state in the dialog. The help facility 
then needs to know how to quickly get the user to the correct 
state and instruct the user on how to proceed. 

There are several ways the system can help the user either 
automatically or explicitly. Explicit requests for help can be 
handled either by a built-in help system that can offer 
general help about how to use the voice interface commands, 
or by navigating to a help site populated with HTML pages 
containing a help system dialog and/or CGI programs to 
implement a more sophisticated help interface. CGls have 
the additional advantage that the calling page can send its 
URL in the QUERY_STRING, thereby enabling the help 
dialog system to return automatically to the same place in 
the application dialog after the help system has completed its 
work. The QUERY_STRING information can also be used 
by the help system to offer context-sensitive help accessed 
from a global help system database. The user can also return 
to the application either by using a "go back" command or 
using a "go home" command to start over. 

Using the above-described INITIALTIMEOUT, 
GAPTIMEOUT, and MAXTIMEOUT special_tags and a 
standard HTML<META H1TP-EQUIV="Refresh". . .>tag, 
the system can take the initiative when the user fails to 
respond or fails to speak a recognizable command within 
specified time periods. Each type of timeout can take the 
user to a specific part of a help system that explains why the 
system took charge and what the user can do next. 

Dialog Application Development Tools 

The present invention also provides dialog application 
development tools, which help an application developer 
quickly build new web-based dialog applications. These 
tools may be implemented at least in part as extensions of 
conventional HTML authoring tools, such as Netscape Com- 
poser or Microsoft Word. 

A dialog application development tool in accordance with 
the invention may, e.g., use the word classification lexicon 



12/24/2003, EAST Version: 1.4.1 



US 6,604,075 Bl 

13 14 

described earlier so as to allow default function assignments This capability can be built into an dialog application 
to be made automatically while a grammar is being speci- development tool, providing the application developer with 
fied. The application developer can then override these a wide variety of choices in developing new speech con- 
defaults with explicit choices. Simultaneously, the tool can trolled web content. In combination with existing web 
automatically write code for parsing the QUERY__INFO S development tool technology, this additional capability 
strings containing the encoded semantic expressions. This makes the development of speech -activated web sites with 
parsing code may then be combined with a semantic trans- rich dialog control easy to implement for application devel- 
formation processor provided to the developer as part of a opers who are not experts in speech processing, 
web-based dialog system development kit (SDK). jt should be noted that various evolving web-based voice 
Additional details regarding elements suitable for use in 10 browser language proposals are now being considered by the 
such an SDK are described in, e.g., M. K. Brown and B. M. World Wide Web Consortium (W3C) Voice Browser Work- 
Buntschuh," A Context-Free Grammar Compiler for Speech ing Group. These emerging standards may influence the 
Understanding Systems," ICSLP'94, Vol. 1, pp. 21-24, particular implementation details associated with a given 
Yokohama, Japan, September 1994, which is incorporated embodiment of the invention, 

by reference herein. 15 The above-described embodiments of the invention are 

FIG. 3 illustrates the operation of a dialog application intended to be illustrative only. Numerous alternative 

development tool 300 in accordance with the invention. The embodiments within the scope of the following claims will 

application development tool 300 includes an authoring tool be apparent to those skilled in the art. 

302 which utilizes GSL to generate an HTML output 304, What is claimed is: 

and parses included or called code to generate CGI output 20 1. An apparatus for implementing a web-based voice 
306. The HTML output 304 is delivered via Internet or other dialog interface, the apparatus comprising; 
web service to a client 310, e.g., to a browser program a first interpreter for receiving information relating to one 
running on a client computer. The CGI output 306 is or more web pages, the first interpreter generating a 
delivered to a web server 128 which also has associated rendering of at least a portion of the information for 
therewith an API 312 and a semantic transformation pro- 25 presentation to a user in an audibly-perceptible format; 
cessor 316. The web server 128 communicates with the a g ramma r processing device having an input coupled to 
client 310 over a suitable network connection. an outpu t 0 f the first interpreter, the grammar process- 
At execution time, the semantic transformation processor ing device utilizing interpreted web page information 
316 runs on the web server 128, e.g., as a module of the web 3Q received from the first interpreter to generate syntax 
server CGI program, and it transforms the parsed semantic information and semantic information; 
expressions from the authoring tool 302 into calls to appli- a speech recognizer which processes user speech in 
cation functions that perform semantic actions through the accordance with the syntax information generated by 
API 312. The API 312 may be written using any of a variety grammar processing device; and 
of well-known languages. Language interface definitions to 35 a second interpreter having an input coupled to an output 
be included in the CGI code can be provided as part of the of the spcecn reC ognizer, the second interpreter pro- 
dialog application development tool for the most popular cessing recognized speech in accordance with the 
languages, e.g., C, C++, Java, Javascript, VisualBasic, Perl, semantics information from the grammar processing 
etc. device to generate output for delivery to a web server 

. y XI a i v • 4 0 in conjunction with a dialog which includes at least a 

Automatic Language Model Expansion r iL j • j.u u 

b h v portion of the rendering and the user speech. 

One possible difficulty remaining for the application 2. The apparatus of claim 1 wherein the grammar pro- 
developer is definition of all the ways a user might state each cessing device comprises a grammar compiler, 
possible command to the speech interface. Simple language 3, The apparatus of claim 2 wherein the grammar pro- 
model expansion, as described previously, relaxes the con- 45 cessing device implements a grammar generation process to 
straints on the user slightly, allowing the user to speak a generate a grammar specification language which is supplied 
variety of phrases containing key words from the original as input to the grammar compiler. 

title. Further language model expansion can be obtained, 4. The apparatus of claim 3 wherein the grammar gen- 

e.g., by using a thesaurus to substitute other words having eration process utilizes a thesaurus to expand the grammar 

similar meaning for words that appeared in the original title. 50 specification language. 

In addition, a hyperlink title can be parsed into its phrase 5, The apparatus of claim 1 wherein the first interpreter 

structure representation, and then transformed into another comprises a web page interpreter capable of interpreting 

phrase structure of the same type, e.g., interrogotory, asser- web pages formatted at least in part using HTML, 

tion or imperative, from which more phrase expressions can 6. The apparatus of claim 1 wherein the second interpreter 

be derived. 55 comprises a natural language interpreter. 

The application developer can then write simple hyperlink 7. The apparatus of claim 1 wherein the output generated 

title statements representing the basic meaning assigned to by the second interpreter is further processed by a common 

that link, using either a natural language expression (e.g., gateway interface formatter prior to delivery to the web 

English sentences as used in the above example) or a higher server. 

level description using phrase structure grammar tags. When 60 8. The apparatus of claim 1 wherein the common gateway 

using natural language, the system generally must first interface formatter formats the output generated by the 

convert the natural language into phrase structure form to second interpreter into a format suitable for a common 

perform structure transformations. When using phrase struc- gateway interface associated with the web server, 

ture format, the application developer generally must use an 9. The apparatus of claim 8 wherein the common gateway 

intermediate level of expression that specifies word classes 65 interface is coupled to a database management system, 

or categories, so that the system will know how to expand 10. The apparatus of claim 1 wherein the first interpreter 

the phrase structure tokens into natural language words. further generates a client library associated with interpreta- 
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lions of web pages previously performed on a common 
client machine, the clienl library including a script language 
definition of semantic actions. 

11. The apparatus of claim 10 further including a client 
executive program which processes information in the client 
library for delivery to the web server. 

12. The apparatus of claim 1 wherein the web page 
information is at least partially in an HTML format. 

13. The apparatus of claim 12 wherein the first interpreter 
includes a capability for interpreting a plurality of voice- 
related HTML tags. 

14. The apparatus of claim 1 wherein dialog control is 
handled by representing a given dialog turn in a single web 
page. 

15. The apparatus of claim 14 wherein a finite state dialog 15 
controller is implemented as a sequence of web pages each 
representing a dialog turn. 

16. The apparatus of claim 1 wherein the processing 
operations of the dialog are associated with an application 
developed using a dialog application development tool. 

17. The apparatus of claim 16 wherein the dialog appli- 
cation development tool comprises an authoring tool which 
utilizes a grammar specification language to generate output 
in a web page format for delivery to one or more clients, and 
parses code to generate a common gateway interface output 25 
for delivery to the web server. 

18. A method for implementing a web-based voice dialog 
interface, the method comprising the steps of: 

generating a rendering of at least a portion of a set of 
information relating to one or more web pages received 30 
over a network, for presentation to a user in an audibly- 
perceptible format; 

utilizing interpreted web page, information to generate 
syntax information and semantic information; 

processing user speech in accordance with the syntax 
information; and 

processing recognized speech in accordance with the 
semantics information to generate output for delivery to 
a web server in conjunction with a dialog which 
includes at least a portion of the rendering and the user 
speech. 



35 



19. A machine -readable medium for storing one or more 
programs for implementing a web-based dialog interface, 
wherein the one or more programs when executed by a 
processing system carry out the steps of: 

generating a rendering of at least a portion of a set of 
information relating to one or more web pages received 
over a network, for presentation to a user in an audibly- 
perceptible formal; 

utilizing interpreted web page information to generate 
syntax information and semantic information; 

processing user speech in accordance with the syntax 
information to generate recognized speech; and 

processing the recognized speech in accordance with the 
semantics information to generate output for delivery to 
a web server in conjunction with a dialog which 
includes at least a portion of the rendering and the user 
speech. 

20. A processing system comprising: 

at least one computer for implementing at least a portion 
of an web-based voice dialog interface, the interface 
including: (i) a first interpreter for receiving informa- 
tion relating to one or more web pages, the first 
interpreter generating a rendering of at least a portion 
of the information for presentation to a user in an 
audibly-perceptible format; (ii) a grammar processing 
device having an input coupled to an output of the first 
interpreter, the grammar processing device utilizing 
interpreted web page information received from the 
first interpreter to generate syntax information and 
semantic information; (iii) a speech recognizer which 
processes user speech in accordance with the syntax 
information generated by the grammar processing 
device; and (iv) a second interpreter having an input 
coupled to an output of the speech recognizer, the 
second interpreter processing recognized speech in 
accordance with the semantics information from the 
grammar processing device to generate output for 
delivery to a web server in conjunction with a dialog 
which includes at least a portion of the rendering and 
the user speech. 
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